In previous studies classification algorithms were employed successfully for the detection of unknown malicious code. Most of these studies extracted features based on byte n-gram patterns in order to represent the inspected files. In this study we represent the inspected files using MLI n-gram patterns which are extracted from the files after disassembly. The MLI n-gram patterns are used as features for the classification process. The classification process main goal is to detect unknown malware within a set of suspected files which will later be included in antivirus software as signatures. A rigorous evaluation was performed using a test collection comprising of more than 36,500 files, in which various settings of MLI n-gram patterns of various size representations and eight types of classifiers were evaluated. A typical problem of this domain is the imbalance problem in which the distribution of the classes in real life varies. It has been observed that class imbalance may produce an important deterioration of the performance achieved by existing learning and classification systems. This situation is often found in real-world data describing infrequent but important cases. We investigated the imbalance problem, referring to several real-life scenarios in which malicious files are expected to be about 15% of the total inspected files. Lastly, we present a chronological evaluation in which the frequent need for updating the training set was evaluated. Evaluation results indicate that the evaluated methodology achieves a level of accuracy higher. Which slightly improves the results in previous studies that use byte n-gram representation?
Loading....